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Abstract 

This paper addresses the problem of handling spatial 
misalignments due to camera-view changes or human-pose 
variations in person re-identification. We first introduce a 
boosting-based approach to learn a correspondence struc¬ 
ture which indicates the patch-wise matching probabilities 
between images from a target camera pair. The learned cor¬ 
respondence structure can not only capture the spatial cor¬ 
respondence pattern between cameras but also handle the 
viewpoint or human-pose variation in individual images. 
We further introduce a global-based matching process. It 
integrates a global matching constraint over the learned 
correspondence structure to exclude cross-view misalign¬ 
ments during the image patch matching process, hence 
achieving a more reliable matching score between images. 
Experimental results on various datasets demonstrate the 
effectiveness of our approach. 



(a) (b) (c) 

Figure 1. (a) and (b): Two examples of using a correspondence 
structure to handle spatial misalignments between images from a 
camera pair. Images are obtained from the same camera pair: A 
and B. The colored squares represent sample patches in each im¬ 
age while the lines between images indicate the matching prob¬ 
ability between patches (line width is proportional to the proba¬ 
bility values), (c): The correspondence structure matrix including 
all patch matching probabilities between A and B (the matrix is 
down-sampled for a clearer illustration). (Best viewed in color) 


1. Introduction 

Person re-identification (Re-ID) is of increasing impor¬ 
tance in visual surveillance. The goal of person Re-ID is to 
identify a specific person indicated by a probe image from 
a set of gallery images captured from cross-view cameras 
(i.e., cameras that are non-overlapping and different from 
the probe image’s camera).^ It remains challenging due to 
the large appearance changes in different camera views and 
the interferences from background or object occlusion. 

One major challenge for person Re-ID is the uncon¬ 
trolled spatial misalignment between images due to camera- 
view changes or human-pose variations. For example, 
in Fig. la, the green patch located in the lower part in 
camera A’s image corresponds to patches from the upper 
part in camera BA image. However, most existing works 
[25, 11, 12, 7, 8, 9, 22, 19] focus on handling the over¬ 
all appearance variations between images, while the spa¬ 
tial misalignment among images’ local patches is not ad¬ 
dressed. Although some patch-based methods [17, 15, 27] 

^In this paper, an image refers to the pixel region of one person which 
is cropped from a larger image of a camera view (cf. Fig. 1) [6]. 


address the spatial misalignment problem by decomposing 
images into patches and performing an online patch-level 
matching, their performances are often restrained by the on¬ 
line matching process which is easily affected by the mis¬ 
matched patches due to similar appearance or occlusion. 

In this paper, we argue that due to the stable setting of 
most cameras (e.g., fixed camera angle or location), each 
camera has a stable constraint on the spatial configuration 
of its captured images. For example, images in Figures la 
and lb are obtained from the same camera pair: A and B. 
Due to the constraint from camera angle difference, body 
parts in camera A’s images are located at lower places than 
those in camera B, implying a lower-to-upper correspon¬ 
dence pattern between them. Meanwhile, constraints from 
camera locations can also be observed. Camera A (which 
monitors an exit region) includes more side-view images, 
while camera B (monitoring a road) shows more front or 
back-view images. This further results in a high probability 
of side-to-front/back correspondence pattern. 

Based on this intuition, we propose to learn a corre¬ 
spondence structure (i.e., a matrix including all patch-wise 
matching probabilities between a camera pair, as Fig. Ic) to 
encode the spatial correspondence pattern constrained by a 
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camera pair, and utilize it to guide the patch matching and 
matching score calculation processes between images. With 
this correspondence structure, spatial misalignments can be 
suitably handled and patch matching results are less inter¬ 
fered by the confusion from appearance or occlusion. In or¬ 
der for the correspondence structure to model human-pose 
variations or local viewpoint changes inside a camera view, 
the correspondence structure for each patch is described by 
a one-to-many graph whose weights indicate the matching 
probabilities between patches, as in Fig. 1. Besides, a global 
constraint is also integrated during the patch matching pro¬ 
cess, so as to achieve a more reliable matching score be¬ 
tween images. Note that our approach is not limited to per¬ 
son re-identification with fixed camera settings. Instead, it 
can also be applied to capture the camera-and-person con¬ 
figuration and cross-view correspondence for unfixed cam¬ 
eras, as demonstrated in the experimental results. 

In summary, our contributions to person Re-ID are three 
folds. First, we introduce a correspondence structure to en¬ 
code cross-view correspondence pattern between cameras, 
and develop a global-based matching process by combin¬ 
ing a global constraint with the correspondence structure 
to exclude spatial misalignments between images. These 
two components in fact establish a novel framework for 
addressing the person Re-ID problem. Second, under this 
framework, we propose a boosting-based approach to learn 
a suitable correspondence structure between a camera pair. 
The learned correspondence structure can not only capture 
the spatial correspondence pattern between cameras but also 
handle the viewpoint or human-pose variation in individual 
images. Third, this paper releases a new and challenging 
benchmark ROAD DATASET for person Re-ID. 

The rest of this paper is organized as follows. Sec. 2 re¬ 
views related works. Sec. 3 describes the framework of the 
proposed approach. Sections 4 to 5 describe the details of 
our proposed global-based matching process and boosting- 
based learning approach, respectively. Sec. 6 shows the ex¬ 
perimental results and Sec. 7 concludes the paper. 

2. Related Works 

Many person re-identification methods have been pro¬ 
posed. Most of them focus on developing suitable fea¬ 
ture representations about humans’ appearance [25, 11, 12, 
7, 14], or finding proper metrics to measure the cross¬ 
view appearance similarity between images [8, 9, 22, 19]. 
Since these works do not effectively model the spatial mis¬ 
alignment among local patches inside images, their perfor¬ 
mances are often limited due to the interferences from view¬ 
point changes and human-pose variations. 

In order to address the spatial misalignment problem, 
some patch-based methods are proposed [23, 17, 3, 15, 27, 
26, 5, 20] which decompose images into patches and per¬ 
form an online patch-level matching to exclude patch-wise 
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Figure 2. Framework of the proposed approach. 


misalignments. In [23, 3], a human body in an image is first 
parsed into semantic parts (e.g., head and torso). And then, 
similarity matching is performed between the correspond¬ 
ing semantic parts. Since these methods are highly depen¬ 
dent on the accuracy of body parser, they have limitations 
in scenarios where the body parser does not work reliably. 

In [17], Oreifej et al. divide images into patches accord¬ 
ing to appearance consistencies and utilize the Earth Movers 
Distance (EMD) to measure the overall similarity among 
the extracted patches. However, since the spatial correlation 
among patches are ignored during similarity calculation, the 
method is easily affected by the mismatched patches with 
similar appearance. Although Ma et al. [15] introduce a 
body prior constraint to avoid mismatching between distant 
patches, the problem is still not well addressed, especially 
for the mismatching between closely located patches. 

To reduce the effect of patch-wise mismatching, some 
saliency-based approaches [27, 26] are recently proposed, 
which estimate the saliency distribution relationship be¬ 
tween images and utilize it to control the patch-wise match¬ 
ing process. Although these methods consider the corre¬ 
spondence constraint between patches, our approach dif¬ 
fers from them in: (1) our approach focuses on constructing 
a correspondence structure where patch-wise matching pa¬ 
rameters are jointly decided by both matched patches. Com¬ 
paratively, the matching weights in the saliency-based ap¬ 
proach [26] is only controlled by patches in the probe-image 
(probe patch). (2) Our approach models patch-wise cor¬ 
respondence by a one-to-many graph such that each probe 
patch will trigger multiple matches during the patch match¬ 
ing process. In contrast, the saliency-based approaches only 
select one best-matched patch for each probe patch. (3) 
Our approach introduces a global constraint to control the 
patch-wise matching result while the patch matching result 
in saliency-based approaches is locally decided by choosing 
the best-matched one within a neighborhood set. 

3. Overview 

The framework of our approach is shown in Fig. 2. Dur¬ 
ing the training process, which is detailed in Section 5, we 















present a boosting-based process to learn the correspon¬ 
dence structure between the target camera pair. During 
the prediction stage, which is detailed in Section 4 given 
a probe image and a set of gallery images, we use the cor¬ 
respondence structure to evaluate the patch correlations be¬ 
tween the probe image and each gallery image, and find the 
optimal one-to-one mapping between patches, and accord¬ 
ingly the matching score. The Re-ID result is achieved by 
ranking gallery images according to their matching scores. 

4. Person Re-Identification with Correspon¬ 
dence Structure 

In this section, we introduce the concept of correspon¬ 
dence structure, show the scheme of computing the patch 
correlation using the correspondence structure, and finally 
present the patch-wise mapping method to compute the 
matching score between the probe image and the gallery 
image. 

4.1. Correspondence structure 

The correspondence structure, Sa,b, encodes the spa¬ 
tial correspondence distribution between a pair of cam¬ 
eras, A and B. In our problem, we adopt a discrete dis¬ 
tribution, which is a set of patch-wise matching probabil¬ 
ities, Sa,b = where Na is the num¬ 
ber of patches of an image in camera A. = 

{P(xf ,xf ),P(xf ),... ,P(xf describes the 

correspondence distribution in an image from camera B 
for the ith patch xf of an image captured from camera A, 
where Nb is the number of patches of an image in B. An 
illustration of the correspondence distribution is shown on 
the top-right of Fig. Ic. 

The definition of the matching probabilities in the corre¬ 
spondence structure only depends on a camera pair and are 
independent to the specific images. In the correspondence 
structure, it is possible that one patch in camera A is highly 
correlated to multiple patches in camera P, so as to han¬ 
dle human-pose variations and local viewpoint changes in a 
camera view. 

4.2. Patch correlation 

Given a probe image U in camera A and a gallery image 
V in camera P, the patch-wise correlation between U and 
V, C{xf, xj), computed from both the correspondence 
structure between cameras A and P and the visual features 
and written as: 

C{x^, xj) = At, (P(xY, xJ )) ■ log ^{f^u, ; xf,xj ). 

( 1 ) 

Here and xJ are ith and jth patch in images U 
and V; and are the feature vectors for and 
xJ. P{xf,xJ) = P{xf^x^) is the correspondence 


structure of cameras A and P. XT^{P{xf ,xj) = 1 
P{xf,xJ) > Tc, and 0 otherwise, and Tc = 0.05 is 
a threshold, ^{^^u ,xj) is the correspondence- 

structure-controlled similarity between and xJ, 


= ^z{fx^,fx^)P{xf ,x^), (2) 


where $; 2 (f i7,f v) is the appearance similarity between 

i 3 

xY and xY . 

* j 

The correspondence structure P{xY,xJ) in Equa¬ 
tions 1 and 2, is used to adjust the appearance simi¬ 
larity that a more reliable patch-wise 

correlation strength can be achieved. The threshold¬ 
ing term Xt^{P{xY, xj)) is introduced to exclude the 
patch-wise correlation with a low correspondence probabil¬ 
ity, which effectively reduces the interferences from mis¬ 
matched patches with similar appearance. 

The patch-wise appearance similarity ^z{^x^ in 
Eq. 2 can be achieved by many off-the-shelf methods [27, 
26, 2]. In this paper, we extract Dense SIFT and Dense 
Color Histogram [27] from each patch and utilize the 
KISSME distance metric [9] to compute f^v)) 

(note that we train different KISSME metrics for patch-pairs 
at different locations). 


4.3. Patch-wise mapping 

With C{xY,xJ), the alignment-enhanced correlation 
strength, we can find a best-matched patch in image V for 
each patch in U and herein calculate the final image match¬ 
ing score. However, locally finding the largest C{xY ^ xJ ) 
may still create mismatches among patch pairs with high 
matching probabilities. For example. Fig. 3a shows an im¬ 
age pair U and V containing different people. When locally 
searching for the largest C{xY ^xj ), the yellow patch in U 
will be mismatched to the bold-green patch in V since they 
have both large appearance similarity and high matching 
probability. This mismatch unsuitably increases the match¬ 
ing score between U and V. 

To address this problem, we introduce a global one-to- 
one mapping constraint and solve the resulting linear as¬ 
signment task [10] to find the best matching: 


PLb y = arg max 
^u,v 




s.t. xf ^ X 
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7 ^ X 


xY }£^2u,v 

V {xf,xJ},{x'^,xY} e ft 


(3) 

u,v 


where is the set of the best patch matching result be¬ 
tween images U and V. {xf ,xj} and {xY ,xY} are two 
matched patch pairs in fl. According to Eq. 3, we want to 
find the best patch matching result ^^y that maximizes the 
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(a) (b) 

Figure 3. Patch matching result (a) by locally finding the largest 
correlation strength C{xY ,xY) for each patch and (b) by using 
a global constraint. The red dashed lines indicate the final patch 
matching results and the colored solid lines are the matching prob¬ 
abilities in the correspondence structure. (Best viewed in color) 


total image matching score 

i’uy = X] C{xY,x]), (4) 

{xV }e^u,v 

given that each patch in U can only be matched to one patch 
in V and vice versa. 

Eq. 3 can be solved by the Hungary method [10]. Fig. 3b 
shows an example of the patch matching result by Eq. 3. 
From Fig. 3b, it is clear that by the inclusion of a global con¬ 
straint, local mismatches can be effectively reduced and a 
more reliable image matching score can be achieved. Based 
on the above process, we can calculate the image matching 
scores t/; between a probe image and all gallery images in a 
cross-view camera, and rank the gallery images accordingly 
to achieve the final Re-ID result [15]. 

5. Correspondence Structure Learning 
5.1. Objective function 

Given a set of probe images {Ua} from camera A and 
their corresponding cross-view images {Vj^} from camera 
B in the training set, we learn the optimal correspondence 
structure 0^ ^ between cameras A and B so that the cor¬ 
rect match image is ranked before the incorrect match im¬ 
ages in terms of the matching scores. The formulation is 
give as below, 

min 'Yl R{Va '; (0a,b)), 

(5) 

where Va' is the correct match gallery image of the probe 
image Ua- {^a,b) (as computed from Eq. 4) is the 

matching score between Ua and Va' and (^a,b) 

is the set of matching scores of all incorrect match images. 
RiVa'; '^UcV^^^Ay), '^Uc.y^^^A^Ay)) is the rank of 
Va' among all the gallery images according to the matching 
scores. Intuitively, the penalty is the smallest if the rank 


is 1, i.e., the matching score of Va' is the greatest. The 
optimization is not easy as the matching score calculation 
(Eq. 4) is complicated. We present an approximate solution, 
a boosting-based process, to solve this problem. 

5.2. Boosting-based learning 

The boosting-based approach utilizes a progressive way 
to find the best correspondence structure with the help of 
binary mapping structures. A binary mapping structure is 
similar to the correspondence structure except that it simply 
utilizes 0 or 1 instead of matching probabilities to indicate 
the connectivity or linkage between patches, cf. Fig. 4a. It 
can be viewed as a simplified version of the correspondence 
structure which includes rough information about the cross¬ 
view correspondence pattern. 

Since binary mapping structures only include simple 
connectivity information among patches, their optimal so¬ 
lutions are tractable for individual probe images. There¬ 
fore, by searching for the optimal binary mapping structures 
for different probe images and utilizing them to progres¬ 
sively update the correspondence structure, suitable cross¬ 
view correspondence patterns can be achieved. 

The entire boosting-based learning process can be de¬ 
scribe by the following steps as well as Algorithm 1 . 

Finding the optimal binary mapping structure. For 
each training probe image Ua, we first create multiple 
candidate binary mapping structures under different search 
ranges by adjacency-constrained search [27], and then find 
the optimal binary mapping structure Mq, such that the rank 
order of Ua's correct match image Va' is minimized un¬ 
der Mq,. Note that we find one optimal binary mapping 
structure for each probe image such that the obtained binary 
mapping structures can include local cross-view correspon¬ 
dence information in different training samples. 

Correspondence Structure Initialization. In this pa¬ 
per, patch-wise matching probabilities P{xYyJ) in the 
correspondence structure are initialized by: 


P°{xf,x'() oc 


0, if d{xY,xA>Ta 


otherwise 


( 6 ) 


d{xY,xY) + C 


where xf is xf's co-located patch in camera B. d{xY yj ) 
is the distance between patches and x ^. It is defined as 
the number of strides to move from to x^ in the zig-zag 
order. is a threshold which is set to be 32 in this paper. 
According to Eq. 6, {xf yj) is inversely proportional to 
the co-located distance between and xj and will equal 
to 0 if the distance is larger than a threshold. 

Binary mapping structure selection. During each it¬ 
eration k in the learning process, we first apply correspon¬ 
dence structure 0^””^ = {P^~^{xY from the previ¬ 
ous iteration to calculate the rank orders of all correct match 









Algorithm 1 Boosting-based Learning Process 
Input: A set of training probe images {Ua} from camera A and 
their corresponding cross-view images {L/ 3 } from camera B 
Output: Sa,b = {P(xf, the correspondence structure 

between {Ua} and {V^} 

1: Find an optimal binary mapping structure for each probe 
image Uc, as described in the 4-th paragraph in Sec 5.2 
2: Set /c = 1. Initialize , yj) by Eq. 6. 

3: Use the current correspondence structure {P^' ,x)")}to 

perform Re-ID on {Pa} and {U/ 3 }, and select 20 binary map¬ 
ping structures based on the Re-ID result, as described in 
the 6-th paragraph in Sec 5.2 

4: Compute updated match probability P^ , xj ) by Eq. 7 

5: Update the matching probabilities P^{xY by Eq. 12 
6: Set/c^/c + l and go back to step 3 if not converged or not 
reaching the maximum iteration number 
7: Output 


images Uq,/ in the training set. Then, we randomly select 20 
Va' where half of them are ranked among top 50% (imply¬ 
ing better Re-ID results) and another half are ranked among 
the last 50% (implying worse Re-ID results). Finally, we 
extract binary mapping structures corresponding to these 
selected images and utilize them to update and boost the 
correspondence structure. 

Note that we select binary mapping structures for both 
high- and low-ranked images in order to include a variety of 
local patch-wise correspondence patterns. In this way, the 
final obtained correspondence structure can suitably handle 
the variations in human-pose or local viewpoints. 

Calculating the updated matching probability. With 
the introduction of the binary mapping structure we 
can model the updated matching probability in the corre¬ 
spondence structure by: 

P^ixY,xJ)= PixY,xJ\MA-PiMA, (7) 

where ) is the updated matching probability be¬ 

tween patches xY and xj in the k-th iteration. is the set 
of binary mapping structures selected in the k-th iteration. 
P(Mc,) = probability for bi- 

nary mapping structure Mc^, where Pnih/La) is the CMC 
score at rank n [21] when using Mc^ as the correspondence 
structure to perform person Re-ID over the training images. 
n is set to be 5 in our experiments. 

P{xf ,xj |Mq,) is the updated matching probability be¬ 
tween xf and xJ when including the local correspondence 
pattern information of Mq,. It can be calculated by: 

P{xY,xJ\MA = P{xJ\xY,Ma) ■ P{xY\MA , ( 8 ) 

P{xJ\xY ,M.a) is the updated probability to correspond 


from xY to xJ when including M^, calculated as 


P{xY\xY,Ma) OC 


1, if e M„ 

otherwise 


( 9 ) 


where m 

xJ ■ •^xY\xY,m 


{x^,x^} is ^ patch-wise link connecting xf and 






^zi^Y ,xY ) 


, where 


^z{xf,xj) is the average appearance similarity [27, 9] 
between patches xf and xJ over all correct match image 
pairs in the training set. x^ is a patch that is connected 
to xf in the binary mapping structure From Eq. 9, 
P{xJ \xf , Me,) will equal to 1 if includes a link be¬ 
tween xY and xJ. Otherwise, P{xJ , M^) will be de¬ 
cided by the relative appearance similarity strength between 
patch pair {xf , xJ } and all patch pairs which are connected 
to xf in the binary mapping structure . 

Furthermore, P{xf |Mc) in Eq. 8 is the updated impor¬ 
tance probability of after including . It can be cal¬ 
culated by integrating the importance probability of each 
individual link in : 


P{xY\MA= HxY\m{xY,xYp'^c.) 

'^{xY ,xY } 

( 10 ) 

where is a patch-wise link in M„, as the red lines 

in Fig. 4a. P{my^v^xVy\M.a) is the importance probability 
of link which is defined similar to P(Mc): 


P{r 




|M„) = 


Un{m 


{xY ,x] 
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( 11 ) 

where 'Pn{'^{xU^x^}) is the rank-n CMC score [21] when 
only using a single link as the correspondence 

structure to perform Re-ID. 

P{xY\m^^u^yjyVy^'M.a) in Eq. 10 is the impact probabil¬ 
ity from link to patch xf, defined as: 


PixY\'m'{^YA}^'^»'> 


0, if d{xY,xY)>Td 


1 

d{xf,xY) + l’ 


otherwise 


where x^ is link end patch in camera A. d(') 

and Td are the same as Eq. 6. 

Correspondence structure update. With the updated 
matching probability P^{xf,xJ) in Eq. 7, the matching 
probabilities in the k-th iteration can be finally updated by: 

P’^ixY, xJ ) = (1 - s)P’‘-\xY,xY) + sPHxY, xJ ) , 

( 12 ) 











(d) (e) (f) 

Figure 4. (a): An example of binary mapping structure (the red 
lines with weight 1 indicate that the corresponding patches are 
connected), (b)-(d): Examples of the correspondence structures 
learned by our approach where (b)-(c) and (d) are the correspon¬ 
dence structures for the VIPeR [6] and 3DPeS [1] datasets, respec¬ 
tively. The line widths in (b)-(d) are proportional to the patch-wise 
probability values, (e): The complete correspondence structure 
matrix of (d) learned by our approach, (f): The correspondence 
structure matrix of (d)’s dataset obtained by the simple-average 
method. (Patches in (e) and (f) are organized by a zig-zag scan¬ 
ning order. Matrices in (e) and (f) are down-sampled for a clearer 
illustration of the correspondence pattern). (Best viewed) 


where ^ {xf ,xj) is the matching probability in itera¬ 
tion k — 1. 5 is the update rate which is set 0.2 in our paper. 

From Equations 7-12, our update process integrates 
multiple variables (i.e., binary mapping structure, individ¬ 
ual links, patch-link correlation) into a unified probabil¬ 
ity framework. In this way, various information cues such 
as appearances, ranking results, and patch-wise correspon¬ 
dence patterns can be effectively included during the model 
updating process. Besides, although the exact convergence 
of our learning process is difficult to analyze due to the in¬ 
clusion of rank score calculation, our experiments show that 
most correspondence structures become stable within 300 
iterations, which implies the reliability of our approach. 

Figures 1 and 4 show some examples of the correspon¬ 
dence structures learned from different cross-view datasets. 
From Figures 1 and 4, we can see that the correspondence 
structures learned by our approach can suitably indicate 
the matching correspondence between spatial misaligned 
patches. For example, in Figures 1 and 4d-4e, the large 
lower-to-upper misalignments between cameras are effec¬ 
tively captured. Besides, the matching probability values in 
the correspondence structure also suitably refiects the cor¬ 
relation strength between different patch locations, as dis¬ 
played by the colored points in Figures Ic and 4e. 

Furthermore, comparing Figures la and lb, we can see 


that the human-pose variation is also suitably handled by 
the learned correspondence structure. More specifically, al¬ 
though images in Fig 1 have different human poses, patches 
of camera A in both figures can correctly find their cor¬ 
responding patches in camera B since the one-to-many 
matching probability graphs in the correspondence structure 
suitably embed the local correspondence variation between 
cameras. Similar observations can also be obtained from 
Figures 4b and 4c. It should be noted that images in the 
dataset of Figures 4b and 4c are taken by unfixed cameras 
(i.e., cameras with unfixed locations). However, the corre¬ 
spondence structure learned by our approach can still effec¬ 
tively encode the camera-person configuration and capture 
the cross-view correspondence pattern accordingly. 

6. Experimental Results 

We perform experiments on the following four datasets: 

VIPeR. The VIPeR dataset [6] is a commonly used 
dataset which contains 632 image pairs for 632 pedestrians, 
as in Figures 4a-4c and 5d. It is one of the most challeng¬ 
ing datasets which includes large differences in viewpoint, 
pose, and illumination between two camera views. Images 
from camera A are mainly captured from 0 to 90 degree 
while camera B mainly from 90 to 180 degree. 

PRID 450S. The PRID 450S dataset [19] consists of 
450 person image pairs from two non-overlapping camera 
views. It is also challenging due to low image qualities and 
viewpoint changes. 

3DPeS. The 3DPeS dataset [1] is comprised of 1012 
images from 193 pedestrians captured by eight cameras, 
where each person has 2 to 26 images, as in Figures 4d 
and 5 a. Note that since there are eight cameras with signifi¬ 
cantly different views in the dataset, in our experiments, we 
group cameras with similar views together and form three 
camera groups. Then, we train a correspondence struc¬ 
ture between each pair of camera groups. Finally, three 
correspondence structures are achieved and utilized to per¬ 
form Re-ID between different camera groups. For images 
from the same camera group, we simply utilize adjacency- 
constrained search [27] to find patch-wise mapping and cal¬ 
culate the image matching score accordingly. 

Road. The road dataset is our own constructed dataset 
which includes 416 image pairs taken by two cameras with 
camera A monitoring an exit region and camera B monitor¬ 
ing a road region, as in Figures 1 and 5g.^ Since images in 
this dataset are taken from a realistic crowd road scene, the 
interferences from severe occlusion and large pose variation 
significantly increase the difficulty of this dataset. 

For all of the above datasets, we follow previous meth¬ 
ods [7, 22, 25] and perform experiments under 50%-training 
and 50%-testing. All images are scaled to 128 x 48. The 

^This dataset will be open to the public soon. 
















patch size in our approach is 24 x 18. The stride size be¬ 
tween neighboring patches is 6 horizontally and 8 verti¬ 
cally for probe images, and 3 horizontally and 4 vertically 
for gallery images. Note that we use smaller stride size in 
gallery images in order to obtain more patches. In this way, 
we can have more flexibilities during patch-wise matching. 

6.1. Results for patch matching 

We compare the patch matching results of three meth¬ 
ods: (1) The adjacency-constrained search method [27, 26] 
which finds a best matched patch for each patch in a probe 
image (probe patch) by searching a fixed neighborhood re¬ 
gion around the probe patch’s co-located patch in a gallery 
image {Adjacency-constrained). (2) The simple-average 
method which simply averages the binary mapping struc¬ 
tures for different probe images (as in Fig. 4a) to be the 
correspondence structure and combines it with a global 
constraint to find the best one-to-one patch matching re¬ 
sult {Simple-average). (3) Our approach which employs a 
boosting-based process to learn the correspondence struc¬ 
ture and combines it with a global constraint to find the best 
one-to-one patch matching result. 

Fig. 5 shows the patch mapping results of different meth¬ 
ods, where solid lines represent matching probabilities in 
a correspondence structure and red-dashed lines represent 
patch matching results. Besides, Figures 4e and 4f show one 
example of the correspondence structure matrix obtained by 
our approach and the simple-average method, respectively. 
From Figures 5 and 4e-4f, we can observe: 

(1) Since the adjacency-constrained method searches a 
fixed neighborhood region without considering the corre¬ 
spondence pattern between cameras, it may easily be in¬ 
terfered by wrong patches with similar appearances in the 
neighborhood (cf. Figures. 5d, 5g). Comparatively, with 
the indicative matching probability information in the cor¬ 
respondence structure, the interference from mismatched 
patches can be effectively reduced (cf. Figures. 5f, 5i). 

(2) When there are large misalignments between cam¬ 
eras, the adjacency-constrained method may fail to find 
proper patches as the correct patches may be located outside 
the neighborhood region, as in Fig. 5a. Comparatively, the 
large misalignment pattern between cameras can be prop¬ 
erly captured by our correspondence structure, resulting in 
a more accurate patch matching result (cf. Fig. 5c). 

(3) Comparing Figures 4e, 4f with the last two columns 
in Fig. 5, it is obvious that the correspondence structures 
by our approach is better than the simple average method. 
Specifically, the correspondence structures by the simple 
average method include many unsuitable matching proba¬ 
bilities which may easily result in wrong patch matches. In 
contrast, the correspondence structures by our approach are 
more coherent with the actual spatial correspondence pat¬ 
tern between cameras. This implies that reliable correspon- 



(g) (h) (i) 

Figure 5. Comparison of different patch mapping methods. Left 
column: the adjacency-constrained method; Middle column: the 
simple-average method; Last column: our approach. The solid 
lines represent matching probabilities in a correspondence struc¬ 
ture and the red-dashed lines represent patch matching results. 
Note that the image pair in (a)-(c) includes the same person (i.e., 
correct match) while the image pairs in (d)-(i) include different 
people (i.e., wrong match). (Best viewed in color) 


deuce structure cannot be easily achieved without suitably 
integrating the information cues between cameras. 

6.2. Results for person re-identification 

We evaluate person re-identification results by the stan¬ 
dard Cumulated Matching Characteristic (CMC) curve [21] 
which measures the correct match rates within different Re- 
ID rank ranges. The evaluation protocols are the same 
as [7]. That is, for each dataset, we perform 10 randomly- 
partitioned 50%-training and 50%-testing experiments and 
average the results. 

We compare results of four methods: (1) Not apply¬ 
ing correspondence structure and directly using the appear¬ 
ance similarity between co-located patches for person Re- 
ID {No-structure)', (2) Simply averaging the binary map¬ 
ping structures for different probe images as the correspon¬ 
dence structure and utilizing it for Re-ID {Simple-average)', 
(3) Using the correspondence structure learned by our ap¬ 
proach, but do not include global constraint when perform¬ 
ing Re-ID {No-global)', (4) Our approach {Proposed). 

We also compare our results with state-of-the-art meth¬ 
ods on different datasets: kLFDA [22], eSDC-ocsvm [27], 
KISSME [19], Salience [26], svmml [13], RankBoost [11] 
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(a) the VIPeR dataset 


(b) the PRID 450S dataset 
Figure 6. CMC curves for different methods. 


(c) the 3DPeS dataset 


Table 1. CMC results on the VIPeR dataset 


Rank 

1 

5 

10 

20 

30 

50 

kLFDA[22] 

32.3 

65.8 

79.7 

90.9 

- 

- 

KISSME[19] 

27 

- 

70 

83 

- 

95 

Salience [26] 

30.2 

52.3 

- 

- 

- 

- 

svmml[13] 

30.1 

63.2 

77.4 

88.1 

- 

- 

RankBoost[ll] 

23.9 

45.6 

56.2 

68.7 

- 

- 

eSDC-ocsvm[27] 

26.7 

50.7 

62.4 

76.4 

- 

- 

LF[18] 

24.2 

- 

67.1 

- 

- 

94.1 

No-structure 

27.5 

57.0 

73.7 

83.9 

87.7 

94.3 

Simple-average 

28.5 

57.9 

74.1 

84.2 

88.3 

94.6 

No-global 

30.8 

62.7 

77.5 

88.9 

91.7 

95.6 

Proposed 

34.8 

68.7 

82.3 

91.8 

94.9 

96.2 


and LF [ 1 8] on the VIPeR dataset; KISSME [19], EIML [8] , 
SCNCD [25], SCNCDEinal [25] on the PRID 450S dataset; 
kLEDA [22], rPCCA [22], PCCA [16] on the 3DPeS 
dataset; and eSDC-knn [27] on the Road dataset. 

Tables 1-4 and Eig. 6 show the CMC results of different 
methods. Erom the CMC results, we can see that: (1) Our 
approach has better Re-ID performances than the state-of- 
the-art methods. This demonstrates the effectiveness of our 
approach. (2) Our approach has obviously improved results 
than the no-structure method. This indicates that proper 
correspondence structures can effectively improve Re-ID 
performances by reducing patch-wise misalignments. (3) 
The simple-average method has similar performance to the 
no-structure method. This implies that unsuitably selected 
correspondence structures cannot improve Re-ID perfor¬ 
mance. (4) The no-global method also has good Re-ID per¬ 
formance. This further demonstrates the effectiveness of the 
correspondence structure learned by our approach. Mean¬ 
while, our approach also has superior performance than the 
no-global method. This demonstrates the usefulness of in¬ 
troducing global constraint in the patch matching process. 

7. Conclusion 

In this paper, we propose a novel framework for ad¬ 
dressing the problem of cross-view spatial misalignments 
in person Re-ID. Our framework consists of two key ingre¬ 


Table 2. CMC results on the PRID 45OS dataset 


Rank 

1 

5 

10 

20 

30 

50 

KISSME[19] 

33 

- 

71 

79 

- 

90 

EIML[8] 

35 

- 

68 

77 

- 

90 

SCNCD[25] 

41.5 

66.6 

75.9 

84.4 

88.4 

92.4 

SCNCDFinal[25] 

41.6 

68.9 

79.4 

87.8 

91.8 

95.4 

No-structure 

39.6 

64.9 

76.0 

85.3 

89.3 

93.3 

Simple-average 

38.2 

63.6 

75.1 

84.9 

88.9 

92.4 

No-global 

42.7 

69.3 

78.2 

87.4 

91.1 

95.1 

Proposed 

44.4 

71.6 

82.2 

89.8 

93.3 

96.0 


Table 3. CMC results on the 3DPeS dataset 


Rank 

1 

5 

10 

15 

20 

30 

kLFDA[22] 

54.0 

77.7 

85.9 

- 

92.4 

- 

rPCCA[22] 

47.3 

75.0 

84.5 

- 

91.9 

- 

PCCA[16] 

41.6 

70.5 

81.3 

- 

90.4 

- 

No-structure 

51.6 

75.8 

84.2 

88.4 

90.5 

92.6 

Simple-average 

50.5 

74.7 

83.2 

87.4 

89.5 

92.6 

No-global 

54.7 

77.9 

87.4 

90.5 

91.6 

93.7 

Proposed 

57.9 

81.1 

89.5 

92.6 

93.7 

94.7 


Table 4. CMC results on the Road dataset 


Rank 

1 

5 

10 

15 

20 

30 

eSDC-knn[27] 

52.4 

74.5 

83.7 

88.0 

89.9 

91.8 

No-structure 

50.5 

80.3 

87.0 

91.3 

94.2 

95.7 

Simple-average 

49.0 

81.7 

90.4 

92.8 

95.7 

96.2 

No-global 

58.2 

85.6 

94.2 

97.1 

98.1 

98.6 

Proposed 

61.5 

91.8 

95.2 

98.1 

98.6 

99.0 


dients: 1) introducing the idea of correspondence structure 
and learning this structure via a novel boosting method to 
adapt to arbitrary camera configurations; 2) a constrained 
global matching step to control the patch-wise misalign¬ 
ments between images due to local appearance ambiguity. 
Extensive experimental results on benchmark show that our 
approach achieves the state-of-the-art performance. 

Under this framework, our future work is devoted to 
explore new variants of the two components, such as: 1) 
designing other correspondence structure learning methods 
that allow for multiple structure candidates to enhance its 
flexibility; 2) devising and incorporating edge-to-edge sim¬ 
ilarity metrics for solving the constrained global matching 
problem as graph matching [4, 24], which has been proven 
more effective in many computer vision applications. 
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